Skip to content

fix: thread thinking_blocks through history so extended thinking survives tool turns#88

Open
triscacezar-droid wants to merge 1 commit intohuggingface:mainfrom
triscacezar-droid:fix/thread-thinking-blocks-through-history
Open

fix: thread thinking_blocks through history so extended thinking survives tool turns#88
triscacezar-droid wants to merge 1 commit intohuggingface:mainfrom
triscacezar-droid:fix/thread-thinking-blocks-through-history

Conversation

@triscacezar-droid
Copy link
Copy Markdown
Contributor

Fixes #87.

Problem

ml-intern requests extended thinking via thinking={"type": "adaptive"} (agent/core/llm_params.py:153), but the assistant messages it stores in history never carry the thinking_blocks / reasoning_content that the model produced. On every tool-continuation turn LiteLLM logs:

LiteLLM:WARNING: Dropping 'thinking' param because the last assistant message
with tool_calls has no thinking_blocks. The model won't use extended thinking
for this turn.

This silently disables extended thinking for ~every turn after the first tool call — effectively turning Opus into non-thinking-mode Opus for the bulk of an agent's work.

Root cause

In agent/core/agent_loop.py:

  1. _call_llm_streaming and _call_llm_non_streaming don't capture thinking_blocks / reasoning_content from responses.
  2. LLMResult has no fields for them.
  3. All three Message(role="assistant", ...) construction sites in the loop drop the thinking state.

Fix

  • LLMResult: new optional thinking_blocks + reasoning_content fields.
  • Streaming path: collect raw chunks during iteration, then reassemble with litellm.stream_chunk_builder(chunks) after the stream ends and pull thinking state off the reassembled message. Wrapped in try/except so unfamiliar providers fall back silently.
  • Non-streaming path: read them directly off response.choices[0].message.
  • Main loop: attach both fields to every Message(role="assistant", ...) construction (truncation site, no-tool-calls terminal site, with-tool-calls site).

Verification

Tested locally with anthropic/claude-opus-4-6 + adaptive thinking:

  • stream_chunk_builder correctly reassembles 1 thinking block + full reasoning_content from streamed chunks.
  • LiteLLM warning disappears on turns where the model actually produced thinking.
  • Trivial prompts where adaptive thinking legitimately skips still show the warning because there's genuinely nothing to attach — semantically correct.

Compatibility

No behavior change for non-thinking models — all new fields default to None and are only populated when present on the response. Safe for HF router / OpenAI / any provider stream_chunk_builder doesn't know how to handle (falls back to None).

@akseljoonas
Copy link
Copy Markdown
Collaborator

@claude please review this PR. Focus on correctness, security, LiteLLM/Bedrock routing impact, and whether it's safe to merge against current main. Keep it concise and prefer inline comments where it matters.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @akseljoonas's task in 2m 13s —— View job


PR Review: fix/thread-thinking-blocks-through-history

  • Gather context and read changed files
  • Review correctness of thinking_blocks threading
  • Assess LiteLLM/Bedrock routing impact
  • Check security surface
  • Verify safety against current main

Verdict: SAFE TO MERGE · Risk: LOW


Correctness

The root cause diagnosis and fix are accurate. LiteLLM's validation checks that the last assistant message with tool_calls carries thinking_blocks before forwarding the thinking parameter — if it doesn't, extended thinking is silently dropped for the remainder of the conversation. The fix correctly captures thinking_blocks / reasoning_content on both paths and attaches them to all three assistant message construction sites in run_agent.

Streaming path (agent_loop.py:399–407): Uses stream_chunk_builder(chunks) post-loop to reassemble the full message, then pulls thinking state with getattr. Wrapped in try/except so any failure degrades silently. One mild inefficiency: stream_chunk_builder re-accumulates content/tool calls that the loop already accumulated — it's called only for the thinking fields. Not a bug, just redundant work.

Non-streaming path (agent_loop.py:465–466): Reads directly from response.choices[0].message. Clean.

or None idiom (getattr(msg, "thinking_blocks", None) or None): Correctly coerces empty lists [] and empty strings "" to None, so messages never get thinking_blocks=[] attached. Intentional and correct.

All three message construction sites updated: truncation path (L644–647), no-tool-calls terminal path (L704–707), and with-tool-calls path (L737–740). Consistent.

Context compaction: The compact() call in context_manager/manager.py:391 creates a new Message(role="assistant", content=summary) without thinking_blocks, but that's fine — compacted-away messages don't need thinking preserved. The recent_messages tail (which includes the current turn's assistant message with tool calls) is kept as-is, so thinking_blocks survives compaction on the messages that matter.


Potential Concern: Memory overhead per streaming call

# agent_loop.py:350
chunks: list = []
# ... 
chunks.append(chunk)   # every chunk stored
# ... post-loop:
reassembled = stream_chunk_builder(chunks)

For Anthropic extended-thinking responses, thinking_blocks are streamed as large chunks before the content deltas. This means every streaming call now holds two copies of the thinking content in memory — the raw chunks plus the reassembled object — until _call_llm_streaming returns. chunks is local and GC'd on return, so there's no leak. For very long thinking responses (Opus at high effort), peak memory per call roughly doubles. Acceptable for now, but worth noting if memory becomes a concern.


LiteLLM / Bedrock / HF Router Impact

No impact. For all non-Anthropic providers:

  • getattr(message, "thinking_blocks", None) returns None
  • The if thinking_blocks: guards mean nothing is ever attached to messages for these providers
  • The try/except around stream_chunk_builder protects against provider-specific chunk formats (Bedrock uses reasoningContent, not thinking_blocks)

The research tool (research_tool.py:385–391) deliberately strips reasoning_content when echoing assistant messages back for HF-router calls — that's intentional and orthogonal to this fix. Not broken here.


Security

No new surface. thinking_blocks flows only from the Anthropic API → in-memory session state → back to the Anthropic API on continuation turns. No user-controlled input, no serialization to new external systems, no path injection risk.


Minor Notes

  1. No tests: A regression test with a mock that returns thinking_blocks on the first call and verifies the second call's messages contain them would give confidence against future regressions. Not a blocker for merge.
  2. Attribute mutation style: assistant_msg.thinking_blocks = thinking_blocks works because LiteLLM's Message Pydantic model allows extra fields. Constructor kwargs would be more robust long-term if LiteLLM ever tightens field validation, but the current approach is consistent with how LiteLLM itself attaches these fields internally.

…ng survives tool turns

Fixes huggingface#87.

ml-intern requests extended thinking via `thinking={"type": "adaptive"}`
(agent/core/llm_params.py), but assistant messages stored in history
never carry the `thinking_blocks` / `reasoning_content` that the model
produced. On every tool-continuation turn LiteLLM logs:

  Dropping 'thinking' param because the last assistant message with
  tool_calls has no thinking_blocks. The model won't use extended
  thinking for this turn.

…which silently turns Opus into non-thinking Opus for the bulk of the
agent's work — a meaningful reasoning-quality regression.

Changes in agent/core/agent_loop.py:

- `LLMResult`: add `thinking_blocks` and `reasoning_content` fields.
- `_call_llm_streaming`: collect raw chunks during iteration and, after
  the stream ends, pass them to `litellm.stream_chunk_builder` to
  reassemble the full message — then pull `thinking_blocks` /
  `reasoning_content` off it. Wrapped in try/except so unfamiliar
  providers just fall back to no thinking reassembly.
- `_call_llm_non_streaming`: read them directly off
  `response.choices[0].message`.
- Main loop: attach both fields to every `Message(role="assistant", ...)`
  construction (truncation-hint site, no-tool-calls terminal site, and
  the critical with-tool-calls site).

Verified locally with anthropic/claude-opus-4-6 + adaptive thinking:
`stream_chunk_builder` reassembles 1 thinking block + full
`reasoning_content` from streamed chunks; the warning disappears on
turns where the model actually produced thinking. Trivial prompts where
adaptive thinking legitimately skips still show the warning because
there is genuinely nothing to attach — semantically correct.

No behavior change for non-thinking models (all new fields default to
None and are only set when present on the response).
@triscacezar-droid triscacezar-droid force-pushed the fix/thread-thinking-blocks-through-history branch from bec8dc4 to a85d363 Compare April 26, 2026 07:16
@triscacezar-droid
Copy link
Copy Markdown
Contributor Author

Rebased onto current main — MERGEABLE again. Conflicts were trivial: main added usage tracking to LLMResult / the streaming setup, this PR added thinking_blocks / reasoning_content / chunks in the same regions. Both sides preserved, no semantic changes from the version you reviewed. The Claude bot review last week came back "safe to merge, low risk".

@akseljoonas would appreciate another look when you have a moment 🙏 — without this fix, every ml-intern run on Anthropic models silently loses extended thinking after the first tool call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extended thinking silently dropped on tool-continuation turns (thinking_blocks not threaded through history)

2 participants