Prevent corruption of DPO VLM training if "keep_end" truncation_mode#5286
Prevent corruption of DPO VLM training if "keep_end" truncation_mode#5286albertvillanova wants to merge 6 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I'm not sure to understand this one. The image tokens are indeed in the beginning, but they are usually not the first tokens. Eg if the data looks like What prevents from truncating the first token ( |
|
@qgallouedec, even in your very edge case, how could you be sure that you are just removing the first token (and not any image token) with "keep_end" in every example when they have different lengths? What this PR is trying to solve is slightly narrower: with The reason I called out image tokens explicitly is that, for VLM inputs, losing them is especially problematic: once the visual tokens are truncated away, the example is no longer a valid multimodal sample and can become semantically inconsistent with the remaining text. By contrast, truncating text-only prefixes is still undesirable, but it is the usual trade-off of sequence truncation and not something specific to vision inputs. I could improve the wording of the error to make it more precise: the core argument is not that image tokens are necessarily the very first tokens, but that they live in the prefix region that
|
Prvent corruption of DPO VLM training if "keep_end" truncation_mode:
Fix #5285.
This PR addresses a regression related to vision-language models (VLMs) and sequence truncation. It ensures that using the
keep_endtruncation mode with VLMs raises a clear error at initialization, preventing silent corruption of training data. The update includes both a code fix and a regression test.Changes
Validation improvements for vision-language models:
DPOTrainer.__init__method to raise aValueErrorif a vision-language dataset is used withtruncation_mode='keep_end', explaining that image tokens would be dropped and recommending alternatives.Testing enhancements:
test_train_vlm_keep_end_raises) to verify that initializing training withtruncation_mode='keep_end'for a vision-language model raises the expected error, preventing silent data corruption.Note
Low Risk
Low risk: adds a defensive init-time validation and a regression test; only affects the VLM +
max_length+truncation_mode='keep_end'configuration by failing fast instead of proceeding.Overview
Prevents silent corruption in DPO vision-language training by failing fast when a vision dataset is used with
max_lengthset andtruncation_mode='keep_end', raising a clearValueErrorduringDPOTrainerinitialization.Adds a regression test to ensure VLM trainer construction with
keep_endtruncation reliably errors (fix for #5285).Written by Cursor Bugbot for commit f36b3c3. This will update automatically on new commits. Configure here.