🎚️ $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)

KL Penalty Control via Perturbation for Direct Preference Optimization,
Sangkyu Lee^1,*, Janghoon Han², Hosung Song², Stanley Jungkyu Choi², Honglak Lee^2,3, Youngjae Yu⁴
¹Yonsei University, ²LG AI Research, ³University of Michigan, Ann Arbor, ⁴Seoul National University
^*Work done during internship at LG AI Research.

This is the official repository of "KL Penalty Control via Perturbation for Direct Preference Optimization":

EpsilonDPOTrainer and EpsilonDPOConfig for $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)
Example training script for Mistral-Instruct and Llama-3-Instrut

Installation

EpsilonDPOTrainer and EpsilonDPOConfig is implemented based on the DPOTrainer and DPOConfig of trl==0.13.0. Therefore, they should work fine in environments compatible with this version. For following our environment, please make sure to set up your environment with Python 3.10, then follow the installation:

pip install -r requirements.txt

If you want to use FlashAttention 2 when using included training script, you need to install flash-attn:

pip install flash-attn --no-build-isolation

Usage

EpsilonDPOTrainer shares arguments with DPOTrainer; it is straightforward to use as follows:

from config import EpsilonDPOConfig
from trainer import EpsilonDPOTrainer

...

args = EpsilonDPOConfig(**args)
trainer = EpsilonDPOTrainer(model=model,
                            ref_model=ref_model,
                            args=args,
                            processing_class=processing_class,
                            train_dataset=train_dataset,
                            eval_dataset=eval_dataset)
trainer.train()

Here, EpsilonDPOConfig additionally requires one more argument from DPOConfig:

epsilon: float=0.01; Parameter controlling the step size of KL penalty relaxation.

The included example training scripts can be used as:

# Mistral-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/mistral_instruct.yaml

# Llama-3-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/llama3_instruct.yaml

If you want to enable FlashAttention 2, please uncomment the attn_implementation: "flash_attention_2" in configs/mistral_instruct.yaml and configs/llama3_instruct.yaml.

Citation

@article{lee2025kl,
  title={KL Penalty Control via Perturbation for Direct Preference Optimization},
  author={Lee, Sangkyu and Han, Janghoon and Song, Hosung and Choi, Stanley Jungkyu and Lee, Honglak and Yu, Youngjae},
  journal={arXiv preprint arXiv:2502.13177},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
configs		configs
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎚️ $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)

Installation

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎚️ $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)

Installation

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages