Skip to content

pspdada/Uni-DPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Uni-DPO: A Unified Paradigm for
Dynamic Preference Optimization of LLMs

δΈ­ζ–‡ | English

🎊 News

  • [2026.03.15] πŸ–ΌοΈ Our poster is now available here. Feel free to check it out!
  • [2026.02.16] πŸ“– Code, data, and models are released!
  • [2026.01.26] πŸŽ‰ Our Uni-DPO is accepted by ICLR 2026!

πŸš€ Overview

Uni-DPO introduces a unified dynamic preference optimization paradigm for training large language models (LLMs) from preference data. Unlike prior DPO-based methods that treat all preference pairs equally, Uni-DPO jointly considers intrinsic data quality and model learning dynamics, enabling more effective and robust preference learning.

Key advantages:

  • Quality-aware: Adaptively prioritizes high-quality preference pairs while down-weighting ambiguous ones.
  • Dynamics-aware: Shifts training focus toward under-fitted samples to mitigate overfitting.
  • Unified & lightweight: Seamlessly integrates dual-perspective weighting and calibrated NLL into standard DPO with minimal overhead.

πŸ“ŒContents

πŸ”‘ Key Features

  • Dual-perspective dynamic weighting for preference optimization. Uni-DPO jointly models what data is worth learning (intrinsic quality) and what the model still struggles with (learning dynamics). By combining a quality-aware weight and a performance-aware weight, Uni-DPO dynamically reallocates training focus throughout optimization.

  • Quality-aware weighting filters ambiguous preference pairs. Preference data varies widely in reliability. Uni-DPO leverages score margins between preferred and rejected responses to assign higher weights to clear, high-quality pairs while suppressing noisy or ambiguous ones.

  • Performance-aware weighting mitigates overfitting during training. High-quality samples are not always the most informative once the model has already mastered them. Uni-DPO introduces a stabilized focal-style performance weight that down-weights well-fitted pairs and emphasizes hard-but-informative examples, effectively reducing overfitting.

  • Decoupling data quality from learning difficulty. Empirical analysis reveals that data quality (score margin) and learning difficulty (reward margin) are weakly correlated. Uni-DPO explicitly models this mismatch, ensuring that optimization is guided by both dimensions rather than relying on either alone.

  • State-of-the-art performance across text, math, and multimodal benchmarks. Uni-DPO consistently outperforms DPO and SimPO across diverse settings.

πŸ“š Dataset

We present the πŸ€— πŸ€– Uni-DPO Dataset, which contains preference pairs for training Uni-DPO across three key domains: textual understanding, mathematical reasoning, and multimodal understanding.

Textual Understanding

The πŸ€— Textual folder contains training data used for Uni-DPO textual understanding experiments, including data used in both v0.1 and v0.2 settings. The exact mapping can be found in the training config folder. To generate these data yourself, refer to this document.

Process of generating data
  1. Download πŸ€— HuggingFaceH4/ultrafeedback_binarized dataset
  2. Run decode.py to generate policy outputs and clean them using post_process.py
  3. Run reward_model_annotate.py to obtain reward scores

Mathematical Reasoning

Training data for mathematical reasoning is located in the πŸ€— Math folder.

If you need to generate this training data yourself, you can refer to this document and use this script.

Process of generating data
  1. Download math question dataset πŸ€— RLHFlow/numia_prompt_dpo1,
  2. Run gen_samples.py to generate model responses,
  3. Score with verifiable_reward_labeling.py and progress_reward_labeling.py,
  4. Build preference pairs using get_uni_dpo_data.py.

Evaluation data are in πŸ€— Math_eval_data.zip. See this document for evaluation details.

Multimodal Understanding

Training data for multimodal understanding are in the πŸ€— Multimodal folder. See this document for details.

πŸ“¦ Model Weights

We release model weights trained with Uni-DPO under two versions: v0.1 and v0.2. The checkpoints cover multiple model families, including Llama3-8B, Gemma-2-9B-IT, and Qwen2.5.

Base Model Training Data Training Setup Uni-DPO Model
πŸ€— Llama-3-8B-Base-SFT πŸ€— v0.1 πŸ€— πŸ€– Llama-3-8B-Base-SFT-Uni-DPO
πŸ€— Llama-3-8B-Base-SFT πŸ€— v0.2 πŸ€— πŸ€– Llama-3-8B-Base-SFT-Uni-DPO-v2-Qwen
πŸ€— Llama-3-8B-Base-SFT πŸ€— v0.2 πŸ€— πŸ€– Llama-3-8B-Base-SFT-Uni-DPO-v2-GPT-4
πŸ€— Llama-3-8B-Instruct πŸ€— v0.1 πŸ€— πŸ€– Llama-3-8B-Instruct-Uni-DPO
πŸ€— Llama-3-8B-Instruct πŸ€— v0.2 πŸ€— πŸ€– Llama-3-8B-Instruct-Uni-DPO-v2-ArmoRM
πŸ€— Llama-3-8B-Instruct πŸ€— v0.2 πŸ€— πŸ€– Llama-3-8B-Instruct-Uni-DPO-v2-GPT-4o
πŸ€— Gemma2-9B-IT πŸ€— v0.1 πŸ€— πŸ€– Gemma2-9B-IT-Uni-DPO
πŸ€— Qwen2.5-7B πŸ€— v0.1 πŸ€— πŸ€– Qwen2.5-7B-Uni-DPO
πŸ€— Qwen2.5-Math-7B πŸ€— v0.1 πŸ€— πŸ€– Qwen2.5-Math-7B-Uni-DPO

πŸ’» Environment Setup

To ensure fair comparison with prior work, we align training and testing environments with the original implementations whenever possible. Below is a brief introduction to the environments used for each task.

Textual Understanding

Training environment: See this document for details.

  • Built based on the SimPO repository.
  • Mainly rely on alignment-handbook and use the Trainer class from the transformers library to construct a UniDPOTrainer class for implementing Uni-DPO training.

For evaluation, the metrics reported in the main paper strictly align with the following four evaluation environments: Arena-Hard-Auto, AlpacaEval2, IFEval, SedarEval. For downstream task evaluation in the appendix, we use the configuration from Language Model Evaluation Harness.

Mathematical Reasoning

Our training and evaluation environments are built based on the Online-DPO-R1 repository. See this document for details.

  • Training data construction: relies on vLLM for model deployment and inference
  • Training: also depends on alignment-handbook and uses the Trainer class from transformers to build the UniDPOTrainer class for Uni-DPO training
  • Evaluation: the evaluation codebase is based on simpleRL-reason.

Multimodal Understanding

Following MM-RLHF. See this document for details.

  • Training: our training environment is built based on LlamaFactory, and we provide a minimal modified version and necessary training scripts
  • Evaluation: our evaluation environment is built based on VLMEvalKit, and we provide the required evaluation scripts and necessary documentation for running the evaluation.

πŸ“ Citation

If you find our model/code/data/paper helpful, please consider citing our papers πŸ“ and starring us ⭐️!

@inproceedings{peng2026unidpo,
  title     = {Uni-{DPO}: A Unified Paradigm for Dynamic Preference Optimization of {LLM}s},
  author    = {Shangpin Peng and Weinong Wang and Zhuotao Tian and Senqiao Yang and Xing W and Haotian Xu and Chengquan Zhang and Takashi Isobe and Baotian Hu and Min Zhang},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=G7DBGlgjjp}
}

πŸ“§ Contact us

If you have any questions, comments, or suggestions, please do not hesitate to submit an issue or PR to help advance research in this area.

πŸ™ Acknowledgement

We thank the following projects for their open-source code and datasets, which greatly facilitated our research:

License

Apache License 2.0

About

[ICLR 2026] Official repository of "Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs".

Topics

Resources

License

Stars

Watchers

Forks