fix: avoid content leak when generation ends inside thinking block by mlpy0 · Pull Request #2001 · exo-explore/exo

mlpy0 · 2026-04-28T12:41:13Z

When generation ends while the parser is still inside <think>...</think>, the chunk that carries finish_reason was unconditionally stamped is_thinking=False in parse_thinking_models. Its text then routed to the content channel, leaking 1-8 characters of the last thinking token into content while the rest of the thinking output stayed in reasoning_content.

This change preserves is_thinking on the final token's text via a separate chunk (with finish_reason=None) and emits an empty-text content chunk to carry finish_reason. Consumers that read only content and finish_reason still see the terminating delta; consumers that read reasoning_content get the full thinking output.

Reproduction (Qwen3.6-27B-8bit, max_tokens=10, default thinking):

before: usage.completion_tokens=10, content=' '
after: usage.completion_tokens=10, content=None, reasoning_content populated

Tests: strengthens TestThinkingModelsFinishReason.test_finish_reason_during_thinking to pin the leak, and adds test_finish_reason_during_thinking_no_content_leak covering the starts_in_thinking=True path that hit the bug in production.

Evanev7 · 2026-05-01T11:10:30Z

sounds like an issue in a downstream client or in our adapters. can you tell us what tools you were using that caused this output, and which endpoint (ollama, responses, chat-completions or claude)?

mlpy0 · 2026-05-01T11:22:58Z

Endpoint: chat-completions (POST /v1/chat/completions, stream=false).

Client: small Python script using urllib.request from the stdlib. No SDK, no wrapper, no adapter on top. Just reads choices[0].message.content out of the JSON.

Model: mlx-community/Qwen3.6-27B-8bit (also reproduced intermittently on Qwen3.6-35B-A3B-8bit).

Sample request that reproduced it:

{
  "model": "mlx-community/Qwen3.6-27B-8bit",
  "messages": [{"role": "user", "content": "Count from 1 to 100, comma separated."}],
  "max_tokens": 10,
  "temperature": 0.0,
  "stream": false
}

Response had completion_tokens=10, finish_reason="stop", content=" ". Sweeping max_tokens up to 1000 keeps the same shape: completion_tokens tracks max_tokens, content stays 1 to 8 characters of whatever the last mid-thinking token decoded to.

Since the script does nothing beyond reading the content field out of the JSON, no downstream parser is in play on this side.

fix: avoid content leak when generation ends inside thinking block

af28324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid content leak when generation ends inside thinking block#2001

fix: avoid content leak when generation ends inside thinking block#2001
mlpy0 wants to merge 1 commit into
exo-explore:mainfrom
mlpy0:fix/thinking-finish-reason-leak

mlpy0 commented Apr 28, 2026

Uh oh!

Evanev7 commented May 1, 2026 •

edited

Loading

Uh oh!

mlpy0 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mlpy0 commented Apr 28, 2026

Uh oh!

Evanev7 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlpy0 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evanev7 commented May 1, 2026 •

edited

Loading