Add support for raw ids in prompts in vLLM client and server#5225
Add support for raw ids in prompts in vLLM client and server#5225qgallouedec merged 10 commits intomainfrom
prompts in vLLM client and server#5225Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d2bb6727b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ea2fcff50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
prompt_token_ids support to vLLM client and serverprompts in vLLM client and server
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks. Just some minor comments below.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Context
Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).
When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via
apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.rollout_funcfrom_generate_single_turnto_generate#5232_generate_single_turn#5239_generate_single_turn#5240To fix that, we need the ability to pass pre-tokenized prompts directly through the client/server pipeline. This PR adds that capability without changing any existing behavior.
Changes
VLLMClient.generate(): Add support forpromptsparameter being tokens. Existing callers usingpromptsare unaffected.vllm_serve.py:GenerateRequestnow accepts bothpromptsas tokens.test_generate_with_token_idsacross all test classes to cover the new code path.Backward compatibility
Fully backward compatible.
Tests
Note
Medium Risk
Changes the request/dispatch logic for the
/generate/endpoint and the server-mode generation path, so regressions could surface in prompt handling (especially around the string-vs-token branching and multimodal/image prompts).Overview
Adds a new token-in path for vLLM generation by allowing
VLLMClient.generate()and the/generate/API (vllm_serve.py) to acceptpromptsaslist[list[int]]in addition to strings; the server now detects token IDs and forwards them to vLLM viaprompt_token_ids(disabling image support for that path).Updates server-mode
VLLMGeneration.generate()to pre-tokenize non-chat prompts withprocessing_classand callvllm_client.generate()with token IDs, and addstest_generate_with_token_idscoverage across the vLLM client/server test variants.Written by Cursor Bugbot for commit f033e63. This will update automatically on new commits. Configure here.