LiteLLM Streaming Content Duplication in Tool Call Responses

## Summary
When using LiteLLM models with ADK's planning features (e.g., `PlanReActPlanner`) in streaming mode, planning and reasoning content appears **twice** in responses when tool calls are made:
1. First during streaming as individual text chunks (lines 1288-1296)
2. Again in the aggregated tool-call message with `content=text` (line 1352)

This violates OpenAI/LiteLLM conventions and creates unnecessary duplication in conversation history.

## Environment
- **ADK Version**: 1.19.0
- **Affected File**: lite_llm.py
- **Python Version**: 3.11+
- **Models Affected**: All non-Gemini models accessed via LiteLLM (Claude, GPT, etc.) when using planning workflows
- **Feature**: Streaming responses with tool calls

## Expected Behavior
According to OpenAI/LiteLLM API specifications:
- When a message contains **only tool calls** (no user-facing answer text), the `content` field should be `None`
- Planning/reasoning text like `<PLANNING>I need to search...</PLANNING>` is **internal reasoning**, not the final answer
- Tool-call messages should follow this structure:
```python
{
  "role": "assistant",
  "content": None,  # No content for tool-only messages
  "tool_calls": [...]
}
```

## Actual Behavior
The aggregated response at line 1348-1359 sets `content=text`, including all accumulated planning/reasoning text:

```python
aggregated_llm_response_with_tool_call = (
    _message_to_generate_content_response(
        ChatCompletionAssistantMessage(
            role="assistant",
            content=text,  # Includes planning text, causing duplication
            tool_calls=tool_calls,
        ),
        model_version=part.model,
        thought_parts=list(reasoning_parts)
        if reasoning_parts
        else None,
    )
)
```

**Result**: Planning text appears twice:
1. **During streaming** (lines 1288-1296): `<PLANNING>I need to search...</PLANNING>` streamed chunk-by-chunk 
2. **In aggregated message** (line 1352): Same text included in `content` field 

## Impact

### 1. **Content Duplication**
- Frontend receives the same planning text twice
- Requires additional filtering logic in application code
- Poor user experience if not handled

### 2. **API Convention Violation**
- OpenAI/Claude/GPT APIs expect `content=None` for tool-only messages
- Current implementation sends `content=<planning_text>`, which is semantically incorrect
- Tool-call messages should not contain answer text in `content`

### 3. **Conversation History Bloat**
- Planning text unnecessarily stored in message `content` field
- Already preserved separately in `thought_parts` (line 1357)
- Increases storage and memory overhead

### 4. **Semantic Confusion**
- `content=text` implies "model generated answer text AND called tools"
- Reality: model only generated internal reasoning before calling tools
- Misrepresents the actual interaction flow

## Steps to Reproduce

1. Create an agent with LiteLLM model:
```python
from google.adk.agents import Agent
from google.adk.models import LiteLlm
from google.adk.planners import PlanReActPlanner

agent = Agent(
    model=LiteLlm(model="vertex_ai/claude-3-5-sonnet-v2@20241022"),
    planner=PlanReActPlanner(),
    tools=[search_tool, ...]
)
```

2. Enable streaming and send a query requiring tools:
```python
async for response in agent.run_streaming("What's the weather in Boston?"):
    print(response.content)
```

3. Observe in logs:
   - Planning text like `<PLANNING>I need to search for weather</PLANNING>` streamed as chunks
   - Same planning text appears again in aggregated response `content` field when tool calls are made

4. Check conversation history:
   - Tool-call message has `content="<PLANNING>..."` instead of `content=None`

## Root Cause

**Lines 1268-1303**: Accumulate all text chunks (including planning) into `text` variable:
```python
text = ""
...
elif isinstance(chunk, TextChunk):
    text += chunk.text  # Accumulates planning/reasoning text
    yield _message_to_generate_content_response(...)  # Already streamed to user
```

**Line 1352**: Includes accumulated text again in aggregated message:
```python
content=text,  # Duplicates already-streamed planning text
```

**Line 1357**: Planning already preserved separately:
```python
thought_parts=list(reasoning_parts) if reasoning_parts else None,
```

## Proposed Fix

Change line 1352 to set `content=None` for tool-only messages:

```python
aggregated_llm_response_with_tool_call = (
    _message_to_generate_content_response(
        ChatCompletionAssistantMessage(
            role="assistant",
            content=None,  # ✅ FIX: No duplication, follows OpenAI/LiteLLM spec
            tool_calls=tool_calls,
        ),
        model_version=part.model,
        thought_parts=list(reasoning_parts)
        if reasoning_parts
        else None,
    )
)
```

## Comparison with Non-Streaming

**Non-streaming path** (around line 770) correctly handles this by:
- Creating a single response with complete tool call information
- No opportunity for duplication (no incremental streaming)

**Streaming path** (lines 1268-1400) has the duplication issue because:
- Text chunks yielded immediately during streaming
- Then included again in final aggregated message

The fix brings streaming behavior in line with non-streaming and API conventions.

## Additional Context

- This issue specifically affects **planning workflows** where models generate reasoning text before calling tools
- Does not affect simple tool-call scenarios without planning text
- `thought_parts` parameter already exists to preserve reasoning separately from message content
- Frontend applications using ADK planning need to implement workarounds to deduplicate content

---

## Recommended Fix (Summary)

```python
# Line 1348-1359
aggregated_llm_response_with_tool_call = (
    _message_to_generate_content_response(
        ChatCompletionAssistantMessage(
            role="assistant",
            content=None,  # ✅ FIX: Avoid duplication, follow OpenAI spec
            tool_calls=tool_calls,
        ),
        model_version=part.model,
        thought_parts=list(reasoning_parts)
        if reasoning_parts
        else None,
    )
)
```

This single-line change eliminates content duplication, aligns with API standards, and maintains semantic correctness for tool-call messages in streaming responses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiteLLM Streaming Content Duplication in Tool Call Responses #3697

Summary

Environment

Expected Behavior

Actual Behavior

Impact

1. Content Duplication

2. API Convention Violation

3. Conversation History Bloat

4. Semantic Confusion

Steps to Reproduce

Root Cause

Proposed Fix

Comparison with Non-Streaming

Additional Context

Recommended Fix (Summary)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LiteLLM Streaming Content Duplication in Tool Call Responses #3697

Description

Summary

Environment

Expected Behavior

Actual Behavior

Impact

1. Content Duplication

2. API Convention Violation

3. Conversation History Bloat

4. Semantic Confusion

Steps to Reproduce

Root Cause

Proposed Fix

Comparison with Non-Streaming

Additional Context

Recommended Fix (Summary)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions