- [2026.03.15] πΌοΈ Our poster is now available here. Feel free to check it out!
- [2026.02.16] π Code, data, and models are released!
- [2026.01.26] π Our Uni-DPO is accepted by ICLR 2026!
Uni-DPO introduces a unified dynamic preference optimization paradigm for training large language models (LLMs) from preference data. Unlike prior DPO-based methods that treat all preference pairs equally, Uni-DPO jointly considers intrinsic data quality and model learning dynamics, enabling more effective and robust preference learning.
Key advantages:
- Quality-aware: Adaptively prioritizes high-quality preference pairs while down-weighting ambiguous ones.
- Dynamics-aware: Shifts training focus toward under-fitted samples to mitigate overfitting.
- Unified & lightweight: Seamlessly integrates dual-perspective weighting and calibrated NLL into standard DPO with minimal overhead.
- Dual-perspective dynamic weighting for preference optimization. Uni-DPO jointly models what data is worth learning (intrinsic quality) and what the model still struggles with (learning dynamics). By combining a quality-aware weight and a performance-aware weight, Uni-DPO dynamically reallocates training focus throughout optimization.
- Quality-aware weighting filters ambiguous preference pairs. Preference data varies widely in reliability. Uni-DPO leverages score margins between preferred and rejected responses to assign higher weights to clear, high-quality pairs while suppressing noisy or ambiguous ones.
- Performance-aware weighting mitigates overfitting during training. High-quality samples are not always the most informative once the model has already mastered them. Uni-DPO introduces a stabilized focal-style performance weight that down-weights well-fitted pairs and emphasizes hard-but-informative examples, effectively reducing overfitting.
- Decoupling data quality from learning difficulty. Empirical analysis reveals that data quality (score margin) and learning difficulty (reward margin) are weakly correlated. Uni-DPO explicitly models this mismatch, ensuring that optimization is guided by both dimensions rather than relying on either alone.
- State-of-the-art performance across text, math, and multimodal benchmarks. Uni-DPO consistently outperforms DPO and SimPO across diverse settings.
We present the π€ π€ Uni-DPO Dataset, which contains preference pairs for training Uni-DPO across three key domains: textual understanding, mathematical reasoning, and multimodal understanding.
The π€ Textual folder contains training data used for Uni-DPO textual understanding experiments, including data used in both v0.1 and v0.2 settings. The exact mapping can be found in the training config folder. To generate these data yourself, refer to this document.
Process of generating data
- Download π€
HuggingFaceH4/ultrafeedback_binarizeddataset - Run
decode.pyto generate policy outputs and clean them usingpost_process.py - Run
reward_model_annotate.pyto obtain reward scores
Training data for mathematical reasoning is located in the π€ Math folder.
If you need to generate this training data yourself, you can refer to this document and use this script.
Process of generating data
- Download math question dataset π€
RLHFlow/numia_prompt_dpo1, - Run
gen_samples.pyto generate model responses, - Score with
verifiable_reward_labeling.pyandprogress_reward_labeling.py, - Build preference pairs using
get_uni_dpo_data.py.
Evaluation data are in π€ Math_eval_data.zip. See this document for evaluation details.
Training data for multimodal understanding are in the π€ Multimodal folder. See this document for details.
We release model weights trained with Uni-DPO under two versions: v0.1 and v0.2. The checkpoints cover multiple model families, including Llama3-8B, Gemma-2-9B-IT, and Qwen2.5.
| Base Model | Training Data | Training Setup | Uni-DPO Model |
|---|---|---|---|
| π€ Llama-3-8B-Base-SFT | π€ | v0.1 | π€ π€ Llama-3-8B-Base-SFT-Uni-DPO |
| π€ Llama-3-8B-Base-SFT | π€ | v0.2 | π€ π€ Llama-3-8B-Base-SFT-Uni-DPO-v2-Qwen |
| π€ Llama-3-8B-Base-SFT | π€ | v0.2 | π€ π€ Llama-3-8B-Base-SFT-Uni-DPO-v2-GPT-4 |
| π€ Llama-3-8B-Instruct | π€ | v0.1 | π€ π€ Llama-3-8B-Instruct-Uni-DPO |
| π€ Llama-3-8B-Instruct | π€ | v0.2 | π€ π€ Llama-3-8B-Instruct-Uni-DPO-v2-ArmoRM |
| π€ Llama-3-8B-Instruct | π€ | v0.2 | π€ π€ Llama-3-8B-Instruct-Uni-DPO-v2-GPT-4o |
| π€ Gemma2-9B-IT | π€ | v0.1 | π€ π€ Gemma2-9B-IT-Uni-DPO |
| π€ Qwen2.5-7B | π€ | v0.1 | π€ π€ Qwen2.5-7B-Uni-DPO |
| π€ Qwen2.5-Math-7B | π€ | v0.1 | π€ π€ Qwen2.5-Math-7B-Uni-DPO |
To ensure fair comparison with prior work, we align training and testing environments with the original implementations whenever possible. Below is a brief introduction to the environments used for each task.
Training environment: See this document for details.
- Built based on the SimPO repository.
- Mainly rely on alignment-handbook and use the
Trainerclass from thetransformerslibrary to construct aUniDPOTrainerclass for implementing Uni-DPO training.
For evaluation, the metrics reported in the main paper strictly align with the following four evaluation environments: Arena-Hard-Auto, AlpacaEval2, IFEval, SedarEval. For downstream task evaluation in the appendix, we use the configuration from Language Model Evaluation Harness.
Our training and evaluation environments are built based on the Online-DPO-R1 repository. See this document for details.
- Training data construction: relies on vLLM for model deployment and inference
- Training: also depends on alignment-handbook and uses the
Trainerclass fromtransformersto build the UniDPOTrainer class for Uni-DPO training - Evaluation: the evaluation codebase is based on simpleRL-reason.
Following MM-RLHF. See this document for details.
- Training: our training environment is built based on LlamaFactory, and we provide a minimal modified version and necessary training scripts
- Evaluation: our evaluation environment is built based on VLMEvalKit, and we provide the required evaluation scripts and necessary documentation for running the evaluation.
If you find our model/code/data/paper helpful, please consider citing our papers π and starring us βοΈοΌ
@inproceedings{peng2026unidpo,
title = {Uni-{DPO}: A Unified Paradigm for Dynamic Preference Optimization of {LLM}s},
author = {Shangpin Peng and Weinong Wang and Zhuotao Tian and Senqiao Yang and Xing W and Haotian Xu and Chengquan Zhang and Takashi Isobe and Baotian Hu and Min Zhang},
booktitle = {The Fourteenth International Conference on Learning Representations},
year = {2026},
url = {https://openreview.net/forum?id=G7DBGlgjjp}
}If you have any questions, comments, or suggestions, please do not hesitate to submit an issue or PR to help advance research in this area.
We thank the following projects for their open-source code and datasets, which greatly facilitated our research:
- Training data generation: ultrafeedback_binarized, RLHFlow/numia_prompt_dpo1, MM-RLHF
- Training: SimPO, alignment-handbook, Online-DPO-R1, LlamaFactory
- Evaluation
- Textual understanding: Arena-Hard-Auto, AlpacaEval2, IFEval, SedarEval, Language Model Evaluation Harness
- Math reasoning: simpleRL-reason
- Multimodal understanding: VLMEvalKit




