Skip to content

Prevent corruption of DPO VLM training if "keep_end" truncation_mode#5286

Open
albertvillanova wants to merge 6 commits intohuggingface:mainfrom
albertvillanova:fix-5285
Open

Prevent corruption of DPO VLM training if "keep_end" truncation_mode#5286
albertvillanova wants to merge 6 commits intohuggingface:mainfrom
albertvillanova:fix-5285

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Mar 13, 2026

Prvent corruption of DPO VLM training if "keep_end" truncation_mode:

  • Raise ValueError when truncation_mode="keep_end" is used for VLM training in DPO.

Fix #5285.

This PR addresses a regression related to vision-language models (VLMs) and sequence truncation. It ensures that using the keep_end truncation mode with VLMs raises a clear error at initialization, preventing silent corruption of training data. The update includes both a code fix and a regression test.

Changes

Validation improvements for vision-language models:

  • Added a check in the DPOTrainer.__init__ method to raise a ValueError if a vision-language dataset is used with truncation_mode='keep_end', explaining that image tokens would be dropped and recommending alternatives.

Testing enhancements:

  • Introduced a regression test (test_train_vlm_keep_end_raises) to verify that initializing training with truncation_mode='keep_end' for a vision-language model raises the expected error, preventing silent data corruption.

Note

Low Risk
Low risk: adds a defensive init-time validation and a regression test; only affects the VLM + max_length + truncation_mode='keep_end' configuration by failing fast instead of proceeding.

Overview
Prevents silent corruption in DPO vision-language training by failing fast when a vision dataset is used with max_length set and truncation_mode='keep_end', raising a clear ValueError during DPOTrainer initialization.

Adds a regression test to ensure VLM trainer construction with keep_end truncation reliably errors (fix for #5285).

Written by Cursor Bugbot for commit f36b3c3. This will update automatically on new commits. Configure here.

@albertvillanova albertvillanova changed the title Prvent corruption of DPO VLM training if "keep_end" truncation_mode Prevent corruption of DPO VLM training if "keep_end" truncation_mode Mar 13, 2026
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

I'm not sure to understand this one. The image tokens are indeed in the beginning, but they are usually not the first tokens. Eg if the data looks like

<user><img>What is it?<assistant>A flower<end>
<user><img>What is it?<assistant>A bus<end>

What prevents from truncating the first token (<user>)?

@albertvillanova
Copy link
Member Author

albertvillanova commented Mar 16, 2026

@qgallouedec, even in your very edge case, how could you be sure that you are just removing the first token (and not any image token) with "keep_end" in every example when they have different lengths? max_length is a single scalar applied uniformly to the whole dataset.

What this PR is trying to solve is slightly narrower: with keep_end, truncation removes the varying-length prefix of the sequence, so in a vision example it can drop the whole prompt prefix including the image placeholder/tokens. In your example, that could indeed mean dropping first, then , then "What is it?", depending on how much needs to be truncated.

The reason I called out image tokens explicitly is that, for VLM inputs, losing them is especially problematic: once the visual tokens are truncated away, the example is no longer a valid multimodal sample and can become semantically inconsistent with the remaining text. By contrast, truncating text-only prefixes is still undesirable, but it is the usual trade-off of sequence truncation and not something specific to vision inputs.

I could improve the wording of the error to make it more precise: the core argument is not that image tokens are necessarily the very first tokens, but that they live in the prefix region that keep_end is designed to discard.

- "truncation_mode='keep_end' is not supported for vision-language models. Image tokens reside in "
- "the prompt at the beginning of the sequence; keeping the end would drop them. Use "
- "truncation_mode='keep_start' (the default) or set max_length=None."
+ "truncation_mode='keep_end' is not supported for vision-language models. Image tokens reside "
+ "inside the prompt portion of the sequence; depending on the example, keep_end may silently "
+ "drop them, causing pixel_values to be forwarded to the model with no corresponding visual "
+ "tokens in input_ids. Use truncation_mode='keep_start' (the default) or set max_length=None."

keep_start does not have this problem: as long as max_length >= prompt_len, image tokens are always safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DPOTrainer silently corrupts VLM training with "keep_end" truncation_mode

3 participants