fix: thread thinking_blocks through history so extended thinking survives tool turns#88
Conversation
|
@claude please review this PR. Focus on correctness, security, LiteLLM/Bedrock routing impact, and whether it's safe to merge against current main. Keep it concise and prefer inline comments where it matters. |
|
Claude finished @akseljoonas's task in 2m 13s —— View job PR Review:
|
…ng survives tool turns Fixes huggingface#87. ml-intern requests extended thinking via `thinking={"type": "adaptive"}` (agent/core/llm_params.py), but assistant messages stored in history never carry the `thinking_blocks` / `reasoning_content` that the model produced. On every tool-continuation turn LiteLLM logs: Dropping 'thinking' param because the last assistant message with tool_calls has no thinking_blocks. The model won't use extended thinking for this turn. …which silently turns Opus into non-thinking Opus for the bulk of the agent's work — a meaningful reasoning-quality regression. Changes in agent/core/agent_loop.py: - `LLMResult`: add `thinking_blocks` and `reasoning_content` fields. - `_call_llm_streaming`: collect raw chunks during iteration and, after the stream ends, pass them to `litellm.stream_chunk_builder` to reassemble the full message — then pull `thinking_blocks` / `reasoning_content` off it. Wrapped in try/except so unfamiliar providers just fall back to no thinking reassembly. - `_call_llm_non_streaming`: read them directly off `response.choices[0].message`. - Main loop: attach both fields to every `Message(role="assistant", ...)` construction (truncation-hint site, no-tool-calls terminal site, and the critical with-tool-calls site). Verified locally with anthropic/claude-opus-4-6 + adaptive thinking: `stream_chunk_builder` reassembles 1 thinking block + full `reasoning_content` from streamed chunks; the warning disappears on turns where the model actually produced thinking. Trivial prompts where adaptive thinking legitimately skips still show the warning because there is genuinely nothing to attach — semantically correct. No behavior change for non-thinking models (all new fields default to None and are only set when present on the response).
bec8dc4 to
a85d363
Compare
|
Rebased onto current main — @akseljoonas would appreciate another look when you have a moment 🙏 — without this fix, every ml-intern run on Anthropic models silently loses extended thinking after the first tool call. |
Fixes #87.
Problem
ml-internrequests extended thinking viathinking={"type": "adaptive"}(agent/core/llm_params.py:153), but the assistant messages it stores in history never carry thethinking_blocks/reasoning_contentthat the model produced. On every tool-continuation turn LiteLLM logs:This silently disables extended thinking for ~every turn after the first tool call — effectively turning Opus into non-thinking-mode Opus for the bulk of an agent's work.
Root cause
In
agent/core/agent_loop.py:_call_llm_streamingand_call_llm_non_streamingdon't capturethinking_blocks/reasoning_contentfrom responses.LLMResulthas no fields for them.Message(role="assistant", ...)construction sites in the loop drop the thinking state.Fix
LLMResult: new optionalthinking_blocks+reasoning_contentfields.litellm.stream_chunk_builder(chunks)after the stream ends and pull thinking state off the reassembled message. Wrapped in try/except so unfamiliar providers fall back silently.response.choices[0].message.Message(role="assistant", ...)construction (truncation site, no-tool-calls terminal site, with-tool-calls site).Verification
Tested locally with
anthropic/claude-opus-4-6+ adaptive thinking:stream_chunk_buildercorrectly reassembles 1 thinking block + fullreasoning_contentfrom streamed chunks.Compatibility
No behavior change for non-thinking models — all new fields default to
Noneand are only populated when present on the response. Safe for HF router / OpenAI / any providerstream_chunk_builderdoesn't know how to handle (falls back toNone).