[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn#5239
[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn#5239qgallouedec merged 58 commits intomainfrom
_generate_single_turn#5239Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ab04b05d6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
trl/trainer/grpo_trainer.py
Outdated
| generate_inputs = {"input_ids": padded_ids, "attention_mask": attention_mask} | ||
| # For VLMs, include multimodal fields as tensors (pixel_values, image_grid_thw, etc.) | ||
| for k, v in multimodal_fields.items(): | ||
| generate_inputs[k] = torch.tensor(np.array(v)) if not isinstance(v, torch.Tensor) else v |
There was a problem hiding this comment.
Pad sequence multimodal fields before tensor conversion
The regular-generation path now converts every multimodal_fields entry with torch.tensor(np.array(v)), but apply_chat_template(..., tokenize=True, return_dict=True) can return per-token lists (for example token_type_ids/mm_token_type_ids) whose lengths differ across prompts in a batch. In that case np.array(v) is ragged and raises ValueError, so generation fails before model.generate; previously this was avoided because return_tensors="pt", padding=True produced padded tensors automatically. This needs sequence-aware padding for those fields rather than direct array conversion.
Useful? React with 👍 / 👎.
… left-padding for per-token fields
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks. Just a question below.
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Context
Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).
When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via
apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.promptsin vLLM client and server #5225rollout_funcfrom_generate_single_turnto_generate#5232_generate_single_turn#5240The previous PR moved tokenization out of
VLLMGeneration.generateso that vLLM always receives raw token IDs. However, the paged and regular generation backends still performed their own independent tokenization. This PR unifies tokenization across all three backends (vLLM, paged, regular) by tokenizing once at the top of_generate_single_turn.Changes
_generate_single_turn(): Move tokenization to the top of the method, before the backend dispatch. For conversational prompts, useapply_chat_template(tokenize=True, return_dict=True)and extract images + multimodal fields. For text prompts, useprocessing_class(text=prompts). All three backends (vLLM, paged, regular) now consume the sameprompt_idslist. Remove the per-backend tokenization that previously existed in the paged and regular paths._generate_single_turn(): Same unification applied.prompt_idsusingpad(). Convert multimodal fields (numpy arrays fromapply_chat_template) to tensors before passing tomodel.generate().Why
Previously, each generation backend tokenized prompts independently:
_generate_single_turn(added in PR4)apply_chat_templateorprocessing_classinternallyapply_chat_template(return_tensors="pt", padding=True)orprocessing_class(return_tensors="pt")(we convert to tensor manually now)This allows in the next PR to move the tokenization outside
_generate_single_turnand into the trainers'generatemethod.Backward compatibility
No user-facing API changes. The only behavioral difference is that the regular and paged paths now use the same tokenization call as the vLLM path, which should produce identical results.
Note
Medium Risk
Changes core generation input preparation across three backends and introduces new tensor conversion/padding behavior (including multimodal fields), which could subtly alter generation outputs or break on processor edge cases.
Overview
Unifies
_generate_single_turninGRPOTrainerandRLOOTrainerto tokenize prompts once and feed the sameprompt_idsinto all generation backends (vLLM, pagedgenerate_batch, and standardmodel.generate), removing backend-specific tokenization.For conversational/VLM inputs, it now uses
apply_chat_template(..., padding=True)then unpads viaattention_mask, carries through multimodal fields (e.g.pixel_values,image_grid_thw), and in the standard generation path manually left-padsprompt_idsand converts multimodal outputs to tensors (addingnumpyas a dependency) before callinggenerate.Written by Cursor Bugbot for commit c098b90. This will update automatically on new commits. Configure here.