feat: accumulate function_call events in streaming path#59
Conversation
|
@maralbahari This is the function_call accumulation PR we discussed in #51. Ready for review when you get a chance. Covers the full streaming lifecycle: Once this lands, #51's |
## Summary - Adds `--tools` and `--tool-choice` options to `record_cassette.py` to inject tool definitions and tool_choice into requests - Adds `tool_calls/tools.json` with 8 tool definitions (get_weather, get_time, get_stock_price, search_web, translate_text, calculate, send_email, read_file) - Adds `record_tool_call_cassettes.sh` to record 8 cassettes covering all four tool_choice modes (auto, none, required, named) × streaming and non-streaming This recorded cassettes in this PR assist to test #59 ## How to Record **1. Start vLLM with tool-call support:** ```bash vllm serve Qwen/Qwen3-30B-A3B-FP8 \ --tool-call-parser hermes \ --enable-auto-tool-choice \ --port 5050 ``` **2. Run the cassette recorder:** ```bash VLLM_URL=http://0.0.0.0:5050 MODEL=Qwen/Qwen3-30B-A3B-FP8 bash tests/cassettes/record_tool_call_cassettes.sh ``` --------- Signed-off-by: maral <maralbahari.98@gmail.com>
The ResponseAccumulator's process_event previously dropped all function_call SSE events via the wildcard arm, causing streaming responses to lose tool-call output items. This adds handlers for OutputItemAdded(function_call), FunctionCallArgumentsDelta, and FunctionCallArgumentsDone — matching the blocking JSON path's behavior and unblocking execute_loop for streaming tool dispatch. Signed-off-by: Ashwin Giridharan <girida@amazon.com>
…tion Feeds real vLLM SSE recordings (gemma-4-26B) through the full accumulator pipeline and verifies OutputItem::FunctionCall is produced with correct name, arguments, call_id, and usage. Also adds a text-only regression guard ensuring no function_call items leak from text-only streams. Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Signed-off-by: Ashwin Giridharan <girida@amazon.com>
…lator Add 10 new cassette tests covering: - All tool_choice modes streaming (auto, required, named, none) - All tool_choice modes non-streaming (validates from_json path) - Reasoning streaming (Qwen3 and GPT-oss) Validates the accumulator correctly handles real multi-tool streaming responses and non-streaming JSON responses from multiple model families. Signed-off-by: Ashwin Giridharan <girida@amazon.com>
35241da to
569a1f5
Compare
|
@maralbahari thanks! I rebased on top of #60 and added cassette tests that exercise all 4
Also added reasoning cassette tests (Qwen3 + GPT-oss) as a regression guard — found an interesting edge case with GPT-oss where it skips 12 cassette tests total, all passing. |
Summary
The streaming path's
ResponseAccumulatorhandles text message events but didn't yet have handlers forfunction_callevents — they fell through the wildcard arm. This adds the missing accumulation so streaming responses includeFunctionCalloutput items, matching the blocking JSON path.Prerequisite for #51 (
execute_loopneedsFunctionCallitems available in the streaming path to dispatch tools).Before
A client sends a streaming request with tools:
{"model": "meta-llama/...", "stream": true, "input": "What's the weather?", "tools": [{"type": "function", "name": "get_weather", ...}]}vLLM responds with
function_callSSE events, but the accumulator ignores them. The client gets back:{"status": "completed", "output": []}After
The same request now produces:
{ "status": "completed", "output": [{ "type": "function_call", "id": "fc_1", "call_id": "call_abc", "name": "get_weather", "arguments": "{\"location\":\"Paris\"}", "status": "completed" }] }What's handled
OutputItemAddedwithitem_type == "function_call"— starts a new in-flightFunctionToolCallFunctionCallArgumentsDelta— appends to argument bufferFunctionCallArgumentsDone— finalizes with authoritative arguments, pushes to outputmatchwith three-way type dispatchTest Plan
Unit tests (8 in
accumulator.rs):Cassette integration tests (12 in
accumulator_cassette_test.rs):tool_choice=autostreaming + non-streaming (parallel tool calls)tool_choice=requiredstreaming + non-streamingtool_choice=namedstreaming + non-streamingtool_choice=nonestreaming + non-streaming (zero function calls)Non-streaming tests exercise the
from_jsonpath against the same cassettes.All tests pass,
cargo fmt --check+cargo clippy -- -D warningsclean.Note: GPT-oss streaming gap
While testing the reasoning cassettes, I found that GPT-oss doesn't emit
response.output_item.addedfor the message item after reasoning — it jumps fromreasoning_text.done→output_text.done→output_item.done. The streaming accumulator can't capture that message because nooutput_item.addedcreates the in-flight state. Qwen3 emits the event correctly. Not blocking — the full output is in theresponse.completedpayload. Filed as #62 for follow-up.