[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs#5242qgallouedec wants to merge 101 commits intomainfrom
Conversation
… left-padding for per-token fields
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3375aeac6c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… and multimodal_fields parameters
…r and RLOOTrainer
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 367a79ebc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Context
Part of the series to fix the re-tokenization bug in GRPO multi-turn tool calling (see #5224).
closes #5224
closes #5144
When the model generates a completion in a tool-calling loop, the decoded text is re-tokenized via
apply_chat_template, which can produce different token IDs due to BPE merge ambiguities. To fix this, we need a token-in / token-out pipeline: tokenize once, then pass raw token IDs through every subsequent generation call — never decoding and re-tokenizing.promptsin vLLM client and server #5225rollout_funcfrom_generate_single_turnto_generate#5232_generate_single_turn#5239_generate_single_turn#5240This is the final PR in the series. It eliminates the re-tokenization in the tool-calling loop — the actual source of the bug.
Changes
_get_tool_suffix_ids(tool_messages)method: Tokenizes only the tool result portion by diffing a minimal dummy conversation (2 messages vs 3 messages). This avoids re-tokenizing the full conversation history._tool_call_loop: Instead of re-tokenizingprompt + completion + tool_resultsviaapply_chat_template, builds the token sequence by concatenation:prompt_ids + completion_ids + tool_suffix_ids. The original prompt and completion token IDs are preserved exactly as they were — only the new tool result tokens are freshly tokenized._tokenize_promptscall in the tool loop.The bug and the fix
Previously, after a tool call:
prompt + assistant + tool_resultswas re-tokenized viaapply_chat_templateNow:
prompt_idsandcompletion_idsare kept as-is (never decoded and re-tokenized)prompt_ids + completion_ids + suffix_idsBackward compatibility
No user-facing API changes.
_get_tool_suffix_idsand_tool_call_loopare internal methods.Note
Medium Risk
Touches GRPO/RLOO generation paths and tool-calling control flow; mistakes could change generated token sequences or break tool-loop behavior, impacting training stability.
Overview
Fixes GRPO multi-turn tool calling to be token-in/token-out. The tool loop no longer decodes and re-tokenizes
prompt + completion + tool_results; it now concatenates existingprompt_ids/completion_idswith freshly tokenized tool-result suffix IDs to preserve the exact original completion tokens.Adds internal helper
_get_tool_suffix_ids()to tokenize only the tool-result formatting, updates_tool_call_loop()to carryimages/multimodal_fieldsthrough regeneration without re-running_tokenize_prompts, and simplifies_generate_single_turn()return values (dropping redundantprompt_idsreturns) in bothgrpo_trainer.pyandrloo_trainer.py.Written by Cursor Bugbot for commit 10708ca. This will update automatically on new commits. Configure here.