[GRPO/RLOO] Tokenize before vLLM generation call#5238
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 09128d6711
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks. Some suggestion below.
In relation with tests (I know this is always a non-trivial issue), the old prepare_multimodal_messages_vllm path in colocate mode (calling llm.chat()) is replaced by the multi_modal_data: {"image": ...} dict path with llm.generate(). Do you think there would be worth adding a unit test for this path (maybe mocking llm)?
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>





Context
Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).
When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via
apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.promptsin vLLM client and server #5225rollout_funcfrom_generate_single_turnto_generate#5232_generate_single_turn#5239_generate_single_turn#5240This PR moves tokenization out of
VLLMGeneration.generateand into the trainers'_generate_single_turn, so that vLLM always receives raw token IDs instead of text or chat messages.Changes
VLLMGeneration.generate(): Replace theprompts(text/messages) parameter withprompts(token ID lists) +images(optional PIL image lists). Remove the internalchat()/ text-tokenization paths — the method now always forwards pre-tokenized IDs to vLLM. For colocate mode, build{"prompt_token_ids": ids, "multi_modal_data": ...}dicts. Remove unused imports (json,is_conversational,prepare_multimodal_messages_vllm)._generate_single_turn(): Tokenize prompts before calling vLLM. For conversational prompts, useapply_chat_template(tokenize=True)and extract PIL images from message content blocks. For text prompts, useprocessing_class(text=prompts). Pass the resulting token IDs and images tovllm_generation.generate()._generate_single_turn(): Same tokenization pattern applied.Why
Previously,
VLLMGeneration.generatereceived raw text or chat messages and was responsible for tokenization (server mode) or delegating it to vLLM'schat()(colocate mode). This made it impossible to guarantee token-level consistency between the trainer's tokenization and vLLM's — a prerequisite for the token-in/token-out pipeline.By tokenizing once in
_generate_single_turnand passing only token IDs downstream, we ensure that:multi_modal_datamechanism.Backward compatibility
The
VLLMGeneration.generatesignature changes from accepting text/messages to accepting token ID lists. This is an internal API — all call sites (GRPO and RLOO trainers) are updated in this PR. No user-facing API changes.Note
Medium Risk
Touches core generation paths for GRPO/RLOO and changes vLLM input semantics, so regressions would affect sampling behavior and distributed generation (especially multimodal/image batches).
Overview
Implements a token-in/token-out vLLM generation path for
GRPOTrainerandRLOOTrainerby moving all prompt tokenization out ofVLLMGeneration.generate()and into the trainers’_generate_single_turn().VLLMGeneration.generate()now always receives pre-tokenizedprompt_token_ids(plus optional per-prompt image lists for VLMs), removes the chat/template/tool-handling branches and JSON tool-arg coercion, and adds distributed-safe gathering ofimagesto avoid collectives deadlocking when some ranks have no images. Colocate mode now builds vLLM prompt dicts (prompt_token_ids+multi_modal_data) before callingllm.generate().Written by Cursor Bugbot for commit fee553d. This will update automatically on new commits. Configure here.