Uni-DPO: A Unified Paradigm for
Dynamic Preference Optimization of LLMs

中文 | English

🎊 News

[2026.03.15] 🖼️ Our poster is now available here. Feel free to check it out!
[2026.02.16] 📖 Code, data, and models are released!
[2026.01.26] 🎉 Our Uni-DPO is accepted by ICLR 2026!

🚀 Overview

Uni-DPO introduces a unified dynamic preference optimization paradigm for training large language models (LLMs) from preference data. Unlike prior DPO-based methods that treat all preference pairs equally, Uni-DPO jointly considers intrinsic data quality and model learning dynamics, enabling more effective and robust preference learning.

Key advantages:

Quality-aware: Adaptively prioritizes high-quality preference pairs while down-weighting ambiguous ones.
Dynamics-aware: Shifts training focus toward under-fitted samples to mitigate overfitting.
Unified & lightweight: Seamlessly integrates dual-perspective weighting and calibrated NLL into standard DPO with minimal overhead.

🔑 Key Features

Dual-perspective dynamic weighting for preference optimization. Uni-DPO jointly models what data is worth learning (intrinsic quality) and what the model still struggles with (learning dynamics). By combining a quality-aware weight and a performance-aware weight, Uni-DPO dynamically reallocates training focus throughout optimization.

Quality-aware weighting filters ambiguous preference pairs. Preference data varies widely in reliability. Uni-DPO leverages score margins between preferred and rejected responses to assign higher weights to clear, high-quality pairs while suppressing noisy or ambiguous ones.

Performance-aware weighting mitigates overfitting during training. High-quality samples are not always the most informative once the model has already mastered them. Uni-DPO introduces a stabilized focal-style performance weight that down-weights well-fitted pairs and emphasizes hard-but-informative examples, effectively reducing overfitting.

Decoupling data quality from learning difficulty. Empirical analysis reveals that data quality (score margin) and learning difficulty (reward margin) are weakly correlated. Uni-DPO explicitly models this mismatch, ensuring that optimization is guided by both dimensions rather than relying on either alone.

State-of-the-art performance across text, math, and multimodal benchmarks. Uni-DPO consistently outperforms DPO and SimPO across diverse settings.

📚 Dataset

We present the 🤗 🤖 Uni-DPO Dataset, which contains preference pairs for training Uni-DPO across three key domains: textual understanding, mathematical reasoning, and multimodal understanding.

Textual Understanding

The 🤗 Textual folder contains training data used for Uni-DPO textual understanding experiments, including data used in both v0.1 and v0.2 settings. The exact mapping can be found in the training config folder. To generate these data yourself, refer to this document.

Process of generating data

Download 🤗 HuggingFaceH4/ultrafeedback_binarized dataset
Run decode.py to generate policy outputs and clean them using post_process.py
Run reward_model_annotate.py to obtain reward scores

Mathematical Reasoning

Training data for mathematical reasoning is located in the 🤗 Math folder.

If you need to generate this training data yourself, you can refer to this document and use this script.

Process of generating data

Download math question dataset 🤗 RLHFlow/numia_prompt_dpo1,
Run gen_samples.py to generate model responses,
Score with verifiable_reward_labeling.py and progress_reward_labeling.py,
Build preference pairs using get_uni_dpo_data.py.

Evaluation data are in 🤗 Math_eval_data.zip. See this document for evaluation details.

Multimodal Understanding

Training data for multimodal understanding are in the 🤗 Multimodal folder. See this document for details.

📦 Model Weights

We release model weights trained with Uni-DPO under two versions: v0.1 and v0.2. The checkpoints cover multiple model families, including Llama3-8B, Gemma-2-9B-IT, and Qwen2.5.

Base Model	Training Data	Training Setup	Uni-DPO Model
🤗 Llama-3-8B-Base-SFT	🤗	v0.1	🤗 🤖 Llama-3-8B-Base-SFT-Uni-DPO
🤗 Llama-3-8B-Base-SFT	🤗	v0.2	🤗 🤖 Llama-3-8B-Base-SFT-Uni-DPO-v2-Qwen
🤗 Llama-3-8B-Base-SFT	🤗	v0.2	🤗 🤖 Llama-3-8B-Base-SFT-Uni-DPO-v2-GPT-4
🤗 Llama-3-8B-Instruct	🤗	v0.1	🤗 🤖 Llama-3-8B-Instruct-Uni-DPO
🤗 Llama-3-8B-Instruct	🤗	v0.2	🤗 🤖 Llama-3-8B-Instruct-Uni-DPO-v2-ArmoRM
🤗 Llama-3-8B-Instruct	🤗	v0.2	🤗 🤖 Llama-3-8B-Instruct-Uni-DPO-v2-GPT-4o
🤗 Gemma2-9B-IT	🤗	v0.1	🤗 🤖 Gemma2-9B-IT-Uni-DPO
🤗 Qwen2.5-7B	🤗	v0.1	🤗 🤖 Qwen2.5-7B-Uni-DPO
🤗 Qwen2.5-Math-7B	🤗	v0.1	🤗 🤖 Qwen2.5-Math-7B-Uni-DPO

💻 Environment Setup

To ensure fair comparison with prior work, we align training and testing environments with the original implementations whenever possible. Below is a brief introduction to the environments used for each task.

Textual Understanding

Training environment: See this document for details.

Built based on the SimPO repository.
Mainly rely on alignment-handbook and use the Trainer class from the transformers library to construct a UniDPOTrainer class for implementing Uni-DPO training.

For evaluation, the metrics reported in the main paper strictly align with the following four evaluation environments: Arena-Hard-Auto, AlpacaEval2, IFEval, SedarEval. For downstream task evaluation in the appendix, we use the configuration from Language Model Evaluation Harness.

Mathematical Reasoning

Our training and evaluation environments are built based on the Online-DPO-R1 repository. See this document for details.

Training data construction: relies on vLLM for model deployment and inference
Training: also depends on alignment-handbook and uses the Trainer class from transformers to build the UniDPOTrainer class for Uni-DPO training
Evaluation: the evaluation codebase is based on simpleRL-reason.

Multimodal Understanding

Following MM-RLHF. See this document for details.

Training: our training environment is built based on LlamaFactory, and we provide a minimal modified version and necessary training scripts
Evaluation: our evaluation environment is built based on VLMEvalKit, and we provide the required evaluation scripts and necessary documentation for running the evaluation.

📝 Citation

If you find our model/code/data/paper helpful, please consider citing our papers 📝 and starring us ⭐️！

@inproceedings{peng2026unidpo,
  title     = {Uni-{DPO}: A Unified Paradigm for Dynamic Preference Optimization of {LLM}s},
  author    = {Shangpin Peng and Weinong Wang and Zhuotao Tian and Senqiao Yang and Xing W and Haotian Xu and Chengquan Zhang and Takashi Isobe and Baotian Hu and Min Zhang},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=G7DBGlgjjp}
}

📧 Contact us

If you have any questions, comments, or suggestions, please do not hesitate to submit an issue or PR to help advance research in this area.

🙏 Acknowledgement

We thank the following projects for their open-source code and datasets, which greatly facilitated our research:

Training data generation: ultrafeedback_binarized, RLHFlow/numia_prompt_dpo1, MM-RLHF
Training: SimPO, alignment-handbook, Online-DPO-R1, LlamaFactory
Evaluation
- Textual understanding: Arena-Hard-Auto, AlpacaEval2, IFEval, SedarEval, Language Model Evaluation Harness
- Math reasoning: simpleRL-reason
- Multimodal understanding: VLMEvalKit

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Math		Math
Multimodal		Multimodal
Textual		Textual
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uni-DPO: A Unified Paradigm for
Dynamic Preference Optimization of LLMs

🎊 News

🚀 Overview

📌Contents

🔑 Key Features

📚 Dataset

Textual Understanding

Mathematical Reasoning

Multimodal Understanding

📦 Model Weights

💻 Environment Setup

Textual Understanding

Mathematical Reasoning

Multimodal Understanding

📝 Citation

📧 Contact us

🙏 Acknowledgement

License

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

🎊 News

🚀 Overview

📌Contents

🔑 Key Features

📚 Dataset

Textual Understanding

Mathematical Reasoning

Multimodal Understanding

📦 Model Weights

💻 Environment Setup

Textual Understanding

Mathematical Reasoning

Multimodal Understanding

📝 Citation

📧 Contact us

🙏 Acknowledgement

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages

Uni-DPO: A Unified Paradigm for
Dynamic Preference Optimization of LLMs