fix: thread thinking_blocks through history so extended thinking survives tool turns by triscacezar-droid · Pull Request #88 · huggingface/ml-intern

triscacezar-droid · 2026-04-23T18:37:24Z

Fixes #87.

Problem

ml-intern requests extended thinking via thinking={"type": "adaptive"} (agent/core/llm_params.py:153), but the assistant messages it stores in history never carry the thinking_blocks / reasoning_content that the model produced. On every tool-continuation turn LiteLLM logs:

LiteLLM:WARNING: Dropping 'thinking' param because the last assistant message
with tool_calls has no thinking_blocks. The model won't use extended thinking
for this turn.

This silently disables extended thinking for ~every turn after the first tool call — effectively turning Opus into non-thinking-mode Opus for the bulk of an agent's work.

Root cause

In agent/core/agent_loop.py:

_call_llm_streaming and _call_llm_non_streaming don't capture thinking_blocks / reasoning_content from responses.
LLMResult has no fields for them.
All three Message(role="assistant", ...) construction sites in the loop drop the thinking state.

Fix

LLMResult: new optional thinking_blocks + reasoning_content fields.
Streaming path: collect raw chunks during iteration, then reassemble with litellm.stream_chunk_builder(chunks) after the stream ends and pull thinking state off the reassembled message. Wrapped in try/except so unfamiliar providers fall back silently.
Non-streaming path: read them directly off response.choices[0].message.
Main loop: attach both fields to every Message(role="assistant", ...) construction (truncation site, no-tool-calls terminal site, with-tool-calls site).

Verification

Tested locally with anthropic/claude-opus-4-6 + adaptive thinking:

stream_chunk_builder correctly reassembles 1 thinking block + full reasoning_content from streamed chunks.
LiteLLM warning disappears on turns where the model actually produced thinking.
Trivial prompts where adaptive thinking legitimately skips still show the warning because there's genuinely nothing to attach — semantically correct.

Compatibility

No behavior change for non-thinking models — all new fields default to None and are only populated when present on the response. Safe for HF router / OpenAI / any provider stream_chunk_builder doesn't know how to handle (falls back to None).

akseljoonas · 2026-04-24T14:08:04Z

@claude please review this PR. Focus on correctness, security, LiteLLM/Bedrock routing impact, and whether it's safe to merge against current main. Keep it concise and prefer inline comments where it matters.

claude · 2026-04-24T14:08:24Z

Claude finished @akseljoonas's task in 2m 13s —— View job

PR Review: `fix/thread-thinking-blocks-through-history`

Gather context and read changed files
Review correctness of thinking_blocks threading
Assess LiteLLM/Bedrock routing impact
Check security surface
Verify safety against current main

Verdict: SAFE TO MERGE · Risk: LOW

Correctness

The root cause diagnosis and fix are accurate. LiteLLM's validation checks that the last assistant message with tool_calls carries thinking_blocks before forwarding the thinking parameter — if it doesn't, extended thinking is silently dropped for the remainder of the conversation. The fix correctly captures thinking_blocks / reasoning_content on both paths and attaches them to all three assistant message construction sites in run_agent.

Streaming path (agent_loop.py:399–407): Uses stream_chunk_builder(chunks) post-loop to reassemble the full message, then pulls thinking state with getattr. Wrapped in try/except so any failure degrades silently. One mild inefficiency: stream_chunk_builder re-accumulates content/tool calls that the loop already accumulated — it's called only for the thinking fields. Not a bug, just redundant work.

Non-streaming path (agent_loop.py:465–466): Reads directly from response.choices[0].message. Clean.

or None idiom (getattr(msg, "thinking_blocks", None) or None): Correctly coerces empty lists [] and empty strings "" to None, so messages never get thinking_blocks=[] attached. Intentional and correct.

All three message construction sites updated: truncation path (L644–647), no-tool-calls terminal path (L704–707), and with-tool-calls path (L737–740). Consistent.

Context compaction: The compact() call in context_manager/manager.py:391 creates a new Message(role="assistant", content=summary) without thinking_blocks, but that's fine — compacted-away messages don't need thinking preserved. The recent_messages tail (which includes the current turn's assistant message with tool calls) is kept as-is, so thinking_blocks survives compaction on the messages that matter.

Potential Concern: Memory overhead per streaming call

# agent_loop.py:350
chunks: list = []
# ... 
chunks.append(chunk)   # every chunk stored
# ... post-loop:
reassembled = stream_chunk_builder(chunks)

For Anthropic extended-thinking responses, thinking_blocks are streamed as large chunks before the content deltas. This means every streaming call now holds two copies of the thinking content in memory — the raw chunks plus the reassembled object — until _call_llm_streaming returns. chunks is local and GC'd on return, so there's no leak. For very long thinking responses (Opus at high effort), peak memory per call roughly doubles. Acceptable for now, but worth noting if memory becomes a concern.

LiteLLM / Bedrock / HF Router Impact

No impact. For all non-Anthropic providers:

getattr(message, "thinking_blocks", None) returns None
The if thinking_blocks: guards mean nothing is ever attached to messages for these providers
The try/except around stream_chunk_builder protects against provider-specific chunk formats (Bedrock uses reasoningContent, not thinking_blocks)

The research tool (research_tool.py:385–391) deliberately strips reasoning_content when echoing assistant messages back for HF-router calls — that's intentional and orthogonal to this fix. Not broken here.

Security

No new surface. thinking_blocks flows only from the Anthropic API → in-memory session state → back to the Anthropic API on continuation turns. No user-controlled input, no serialization to new external systems, no path injection risk.

Minor Notes

No tests: A regression test with a mock that returns thinking_blocks on the first call and verifies the second call's messages contain them would give confidence against future regressions. Not a blocker for merge.
Attribute mutation style: assistant_msg.thinking_blocks = thinking_blocks works because LiteLLM's Message Pydantic model allows extra fields. Constructor kwargs would be more robust long-term if LiteLLM ever tightens field validation, but the current approach is consistent with how LiteLLM itself attaches these fields internally.

…ng survives tool turns Fixes huggingface#87. ml-intern requests extended thinking via `thinking={"type": "adaptive"}` (agent/core/llm_params.py), but assistant messages stored in history never carry the `thinking_blocks` / `reasoning_content` that the model produced. On every tool-continuation turn LiteLLM logs: Dropping 'thinking' param because the last assistant message with tool_calls has no thinking_blocks. The model won't use extended thinking for this turn. …which silently turns Opus into non-thinking Opus for the bulk of the agent's work — a meaningful reasoning-quality regression. Changes in agent/core/agent_loop.py: - `LLMResult`: add `thinking_blocks` and `reasoning_content` fields. - `_call_llm_streaming`: collect raw chunks during iteration and, after the stream ends, pass them to `litellm.stream_chunk_builder` to reassemble the full message — then pull `thinking_blocks` / `reasoning_content` off it. Wrapped in try/except so unfamiliar providers just fall back to no thinking reassembly. - `_call_llm_non_streaming`: read them directly off `response.choices[0].message`. - Main loop: attach both fields to every `Message(role="assistant", ...)` construction (truncation-hint site, no-tool-calls terminal site, and the critical with-tool-calls site). Verified locally with anthropic/claude-opus-4-6 + adaptive thinking: `stream_chunk_builder` reassembles 1 thinking block + full `reasoning_content` from streamed chunks; the warning disappears on turns where the model actually produced thinking. Trivial prompts where adaptive thinking legitimately skips still show the warning because there is genuinely nothing to attach — semantically correct. No behavior change for non-thinking models (all new fields default to None and are only set when present on the response).

triscacezar-droid · 2026-04-26T07:16:39Z

Rebased onto current main — MERGEABLE again. Conflicts were trivial: main added usage tracking to LLMResult / the streaming setup, this PR added thinking_blocks / reasoning_content / chunks in the same regions. Both sides preserved, no semantic changes from the version you reviewed. The Claude bot review last week came back "safe to merge, low risk".

@akseljoonas would appreciate another look when you have a moment 🙏 — without this fix, every ml-intern run on Anthropic models silently loses extended thinking after the first tool call.

triscacezar-droid force-pushed the fix/thread-thinking-blocks-through-history branch from bec8dc4 to a85d363 Compare April 26, 2026 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: thread thinking_blocks through history so extended thinking survives tool turns#88

fix: thread thinking_blocks through history so extended thinking survives tool turns#88
triscacezar-droid wants to merge 1 commit intohuggingface:mainfrom
triscacezar-droid:fix/thread-thinking-blocks-through-history

triscacezar-droid commented Apr 23, 2026

Uh oh!

akseljoonas commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

triscacezar-droid commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

triscacezar-droid commented Apr 23, 2026

Problem

Root cause

Fix

Verification

Compatibility

Uh oh!

akseljoonas commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix/thread-thinking-blocks-through-history

Correctness

Potential Concern: Memory overhead per streaming call

LiteLLM / Bedrock / HF Router Impact

Security

Minor Notes

Uh oh!

triscacezar-droid commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 24, 2026 •

edited

Loading

PR Review: `fix/thread-thinking-blocks-through-history`