fix: avoid content leak when generation ends inside thinking block#2001
fix: avoid content leak when generation ends inside thinking block#2001mlpy0 wants to merge 1 commit into
Conversation
|
sounds like an issue in a downstream client or in our adapters. can you tell us what tools you were using that caused this output, and which endpoint (ollama, responses, chat-completions or claude)? |
|
Endpoint: chat-completions (POST /v1/chat/completions, stream=false). Client: small Python script using urllib.request from the stdlib. No SDK, no wrapper, no adapter on top. Just reads choices[0].message.content out of the JSON. Model: mlx-community/Qwen3.6-27B-8bit (also reproduced intermittently on Qwen3.6-35B-A3B-8bit). Sample request that reproduced it: Response had completion_tokens=10, finish_reason="stop", content=" ". Sweeping max_tokens up to 1000 keeps the same shape: completion_tokens tracks max_tokens, content stays 1 to 8 characters of whatever the last mid-thinking token decoded to. Since the script does nothing beyond reading the content field out of the JSON, no downstream parser is in play on this side. |
When generation ends while the parser is still inside
<think>...</think>, the chunk that carriesfinish_reasonwas unconditionally stampedis_thinking=Falseinparse_thinking_models. Its text then routed to the content channel, leaking 1-8 characters of the last thinking token intocontentwhile the rest of the thinking output stayed inreasoning_content.This change preserves
is_thinkingon the final token's text via a separate chunk (withfinish_reason=None) and emits an empty-text content chunk to carryfinish_reason. Consumers that read onlycontentandfinish_reasonstill see the terminating delta; consumers that readreasoning_contentget the full thinking output.Reproduction (Qwen3.6-27B-8bit, max_tokens=10, default thinking):
usage.completion_tokens=10,content=' 'usage.completion_tokens=10,content=None,reasoning_contentpopulatedTests: strengthens
TestThinkingModelsFinishReason.test_finish_reason_during_thinkingto pin the leak, and addstest_finish_reason_during_thinking_no_content_leakcovering thestarts_in_thinking=Truepath that hit the bug in production.