Skip to content

Multi-turn tool call chat completions conversations#712

Open
jaredoconnell wants to merge 9 commits intovllm-project:mainfrom
jaredoconnell:feat/multi-turn-tools-chat
Open

Multi-turn tool call chat completions conversations#712
jaredoconnell wants to merge 9 commits intovllm-project:mainfrom
jaredoconnell:feat/multi-turn-tools-chat

Conversation

@jaredoconnell
Copy link
Copy Markdown
Collaborator

Summary

This PR adds client-side chat completions conversations to the http backend.

Details

  • The design is data-driven. The only data passed to the backend is an API field not included in datasets (required vs auto) and the behavior on how to handle missing tool calls.
  • Allows external datasets or synthetic data. If synthetic data, it's a simple JSON result with a field populated with the same generation logic as any other synthetic data in GuideLLM.
  • Allows specifying tool calls as auto or required to the model. Good for testing various scenarios. Models behave differently depending on the value set. required is best for predictability.
  • Allows specifying how to handle missing tool calls. Useful to set whether it's okay or an error condition, and if it is okay, whether to end the conversation early or continue the conversation.

Test Plan

  • Run the tests
  • Follow the documentation to run vLLM with tool calls enabled

Related Issues


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Adds variable size responses for synthetic data, and better handles edge cases for external datasets.
Also improves documentation.

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Extracts functionality to new static methods.

Assisted-by: Claude Code Sonnet 4.5
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Copy link
Copy Markdown
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished reviewing; will add more comments in a bit. Github is acting up and not letting me add to this review for some reason.

Comment thread src/guidellm/backends/openai/http.py Outdated
Comment thread src/guidellm/benchmark/outputs/html.py Outdated
Comment on lines +164 to +169
# Preserve the original turn index from the column name so
# that sparse columns (e.g. tools_0, tools_3) stay aligned
# with the turns they belong to.
for original_turn, column_name in sorted(turn_columns):
column_type = cast("GenerativeDatasetColumnType", column_type)
mappings[(column_type, turn)].append((index, column_name))
mappings[(column_type, original_turn)].append((index, column_name))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this change since I debated this behavior initially, but remove the comment and rename the var back to turn since original_turn is meaningless to someone who is not looking at the original code.

Comment thread src/guidellm/schemas/info.py Outdated
Comment on lines +171 to +179
stop_conversation: bool = Field(
default=False,
description=(
"When True, the worker cancels all remaining turns in the "
"conversation after this request completes successfully. "
"Set by the backend when a turn's result means the conversation "
"should not continue (e.g. expected tool call not produced)."
),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use exceptions to trigger early stops from the backend so I prefer that route. "error_stop" can throw any exception and "ignore_stop" might work to throw asyncio.CancelledError (See future comment for examples).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need these changes with above suggestion.

Comment thread src/guidellm/backends/openai/request_handlers.py Outdated
Comment thread src/guidellm/backends/openai/http.py Outdated
Comment thread src/guidellm/backends/openai/http.py Outdated
Comment thread src/guidellm/backends/openai/request_handlers.py Outdated
Comment on lines +583 to +594
# Tool calling requires the model to stop naturally after producing
# valid JSON; ignore_eos would force generation past that point and
# break the server's constrained decoding grammar.
body.pop("ignore_eos", None)
body.pop("stop", None)

# On tool-call turns, let the model finish valid JSON naturally;
# max_completion_tokens would truncate output mid-JSON and corrupt
# the arguments sent in conversation history on follow-up turns.
if data.expects_tool_call:
body.pop("max_completion_tokens", None)
body.pop("max_tokens", None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem right. Either all of these key pops belong under if data.expects_tool_call or none of them do. If you pop ignore_eos but not max_tokens then you end up with the same problem as is mentioned in the first comment.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its also a very bad idea to remove these keys unless we absolutely have to because it could destroy benchmark reproducibility. So if you haven't already I would suggest you double check the behavior.

Comment thread src/guidellm/backends/openai/request_handlers.py Outdated
Comment thread src/guidellm/backends/openai/request_handlers.py Outdated
Comment thread src/guidellm/backends/openai/request_handlers.py Outdated
Comment thread src/guidellm/schemas/response.py Outdated
Comment thread src/guidellm/schemas/response.py Outdated
Comment thread src/guidellm/__main__.py Outdated
Comment on lines +411 to +433
# Tool calling configuration
@click.option(
"--tool-choice",
type=str,
default=None,
help=(
'Tool choice mode: "required", "auto", or "none". '
"Controls whether the model is forced to produce tool calls on "
"tool-call turns. Overrides the per-request default (required) set "
"when tools come from the dataset."
),
)
@click.option(
"--tool-call-missing-behavior",
type=click.Choice(["ignore_continue", "ignore_stop", "error_stop"]),
default=None,
help=(
"What the worker does when a tool call is expected but the model "
"does not produce one. ignore_continue: continue to next turn, "
"ignore_stop: cancel remaining turns, error_stop: error and "
"cancel remaining turns. Default: error_stop."
),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new top-level args. Handle this in --backend-kwargs

Comment thread src/guidellm/__main__.py Outdated
Comment thread docs/guides/multiturn.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be its own guide, not inserted into the generic Multiturn one.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

These are the diffs recommended in the comments. They are untested, and require some follow-up changes.

Co-authored-by: Samuel Monson <smonson@irbash.net>
Signed-off-by: Jared O'Connell <46976761+jaredoconnell@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jaredoconnell.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 5, 2026
Moves documentation. Switches fully to exceptions to stop conversations early.
Also includes info gained from vLLM contributor.

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <joconnel@redhat.com>
Comment thread docs/guides/multiturn.md
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

]

# Short placeholder used when no tool_response_tokens size is configured.
DEFAULT_SYNTHETIC_TOOL_RESPONSE = '{"status": "ok"}'
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 416 to 459
@@ -434,6 +449,12 @@ async def _process_next_request( # noqa: C901
logger.opt(exception=True).debug(
f"Backend exception for request {request_info.request_id}"
)
# Cancel remaining conversation turns on backend error
for skip_req, skip_info in conversation:
skip_info.error = f"Cancelled: {request_info.error}"
skip_info.timings.resolve_end = time.time()
self._send_update("cancelled", None, skip_req, skip_info)
conversation.clear()
finally:
if request_info is not None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this design?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support benchmarking of reasoning and tool calling Chat Completions requests

2 participants